6  Demographics table with table1

Author

Lea Nehme and Zhayda Reilly

Published

May 18, 2023

Code
library(conflicted)
conflict_prefer("filter", "dplyr", quiet = TRUE)
conflict_prefer("lag", "dplyr", quiet = TRUE)

suppressPackageStartupMessages(library(tidyverse))

# suppress "`summarise()` has grouped output by " messages
options(dplyr.summarise.inform = FALSE)

7 Introduction

In most scientific research journals, the first included table is often referred to as Table1.

Table1 includes descriptive statistics for the total study sample, with the rows (explanatory variables) consisting of the key study variables that are often included in the final analysis1. Then within the columns (outcome of interest/response variable), you will find cells given as a n (%) for categorical variables, whereas a mean, SD, or the median will be provided for continuous variables. Additionally, there will be a total column provided which can help in the assessment of the overall sample.

There are a few ways that Table1 can be created without using functions like table1. For example, for a research paper, dissertation, and/or the results of a clinical trial, we might have to go through the time consuming task of using the summary() function, or the describe() function in the library(Hmisc). Then, if you decide that you would like to see a variable in relation to a categorical response/outcome variable, you would have to build 2 x 2 tables for each explanatory variable. The table1() function does all of this for us, but first lets discuss the packages that will be in use today.

7.1 Necessary Packages

The htmlTable package allows for the usage of the table1() function to create a table 1, while also making life easy when attempting to copy this table into a Word document.

The boot package was created to aid in performing bootstrapping analysis. With it comes numerous data sets, specifically clinical trial data sets to make this possible. However, there is no code book provided within the package when the data is downloaded as a csv file. This is a link on Github that explains and elaborates on every data within the package itself2.

#install.packages("htmlTable")
#install.packages("boot")

# Load libraries
library(htmlTable)
library(table1)
library(boot)

8 Data Exploration and Wrangling

Today, we will be using the melanoma data set which consists of malignant melanoma measurements of patients. Each patient had their tumor surgically removed between the years of 1962 and 1977 at the Department of Plastic Surgery, University Hospital of Odense located in Denamrk. Each surgery consisted of the complete removal of the tumor with an additional removal of about 2.5cm of the surrounding skin. When this was completed, the thickness of the tumor was recorded along with the physical appearance of ulceration vs no ulceration, as it is an important prognostic indication of those with a thick/ulcerated tumor to have an increased chance of death as a consequence of melanoma.

library(readxl)
setwd("~/OneDrive - Florida International University/RMethods_2023/Presentation05182023")
Error in setwd("~/OneDrive - Florida International University/RMethods_2023/Presentation05182023"): cannot change working directory
      
# melanoma_data <- read.csv(
#   "data/melanoma.csv", header = TRUE, sep = ","
# )
data(melanoma, package = "boot")
melanoma_data <- melanoma

#Now that we loaded the raw data set, we will conduct a visual
#exploration before wrangling the data and applying any functions,
#while also considering the requirements involved in the construction
#of a table1.

summary(melanoma_data)
      time          status          sex             age            year     
 Min.   :  10   Min.   :1.00   Min.   :0.000   Min.   : 4.0   Min.   :1962  
 1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:42.0   1st Qu.:1968  
 Median :2005   Median :2.00   Median :0.000   Median :54.0   Median :1970  
 Mean   :2153   Mean   :1.79   Mean   :0.385   Mean   :52.5   Mean   :1970  
 3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:65.0   3rd Qu.:1972  
 Max.   :5565   Max.   :3.00   Max.   :1.000   Max.   :95.0   Max.   :1977  
   thickness         ulcer      
 Min.   : 0.10   Min.   :0.000  
 1st Qu.: 0.97   1st Qu.:0.000  
 Median : 1.94   Median :0.000  
 Mean   : 2.92   Mean   :0.439  
 3rd Qu.: 3.56   3rd Qu.:1.000  
 Max.   :17.42   Max.   :1.000  

Let us now explore the type of variables within the data set.

typeof(melanoma_data$status) 
[1] "double"

We will first provide a basic table1 to illustrate how the function works. Currently, all the variables are in numeric/double formats, however for the creation of a basic table1, it is of importance to convert the dependent/response variable of interest to reflect categories (factor).

Our main variable of interest (dependent/response) is the status. According to the code book found in Github, status is coded into three levels that indicate the patients status at the end of the study. Level 1 indicates that they had died from melanoma, Level 2 indicates that they were still alive at the conclusion of the study, and Level 3 indicates that they had died from causes unrelated to their melanoma. As such, we will factor the “status” variable into three levels.

With this in mind, let us go ahead and convert melanoma into a factor variable with three levels. For ease of analysis we will use 2 = “Alive” as the reference level. This can be done in two ways:

  1. Although more time consuming, it is highly recommended that beginners utilize the function as.factor() and then utilize the recode_factor() function to minimize the errors.
  2. When you become more skilled and are able to understand how the factor function works, it is possible to do everything in one step with the factor() function. In this function you can put levels and labels all in one function instead of having to break it up into more than one function.

For our example we will use as.factor then recode_factor() using 2 = “Alive” as our reference group.

melanoma_data$status <-
  as.factor(melanoma_data$status)

# print the first six observations
head(melanoma_data$status)
[1] 3 3 2 3 1 1
Levels: 1 2 3

# Recode
melanoma_data$status <- recode_factor(
  melanoma_data$status, 
  "2" = "Alive", # this is the reference group
  "1" = "Died from melanoma",
  "3" = "Non-Melanoma death"
)

# Print the first six observations
head(melanoma_data$status)
[1] Non-Melanoma death Non-Melanoma death Alive              Non-Melanoma death
[5] Died from melanoma Died from melanoma
Levels: Alive Died from melanoma Non-Melanoma death

As you can see in the variable levels, “Alive” is the reference level. It is extremely important to pick a reference level to lay the foundation of the table along with highlighting the outcome of interest of your hypothesis. In summary, this lays the foundation of a well organized table.

9 Basic table 1

Now that our main variable of interest is a factor with three levels, we will run a basic table1 with the independent/explanatory variables of interest: sex, age, ulcer, and thickness.

Recall that the explanatory variables of interest are still in “double” formats. Conveniently, to analyze data before the independent variables are converted to factors and labeled, the table1 provides the ability to highlight level results. This only applies for independent variables that are in numeric/double formats in which each number represents a group. For instance 0 although is a number format we know it has a group meaning such as male.

For the independent variables, if they have factors in the front, it provides the number of cases (aka observations). If they are a continuous variable, we will get the mean, the SD, the minimum and the maximum amounts.

basic_table1 <- table1( 
  ~ factor(sex) + age + factor(ulcer) + thickness | status, 
  data = melanoma_data
)

basic_table1
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)
Overall
(N=205)
factor(sex)
0 91 (67.9%) 28 (49.1%) 7 (50.0%) 126 (61.5%)
1 43 (32.1%) 29 (50.9%) 7 (50.0%) 79 (38.5%)
age
Mean (SD) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9) 52.5 (16.7)
Median [Min, Max] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0] 54.0 [4.00, 95.0]
factor(ulcer)
0 92 (68.7%) 16 (28.1%) 7 (50.0%) 115 (56.1%)
1 42 (31.3%) 41 (71.9%) 7 (50.0%) 90 (43.9%)
thickness
Mean (SD) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63) 2.92 (2.96)
Median [Min, Max] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6] 1.94 [0.100, 17.4]

If we do not put factor for a grouped variable then the following will happen:

wrong_table1 <- table1(
  ~ sex + age + ulcer + thickness | status, 
  data = melanoma_data
)

wrong_table1
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)
Overall
(N=205)
sex
Mean (SD) 0.321 (0.469) 0.509 (0.504) 0.500 (0.519) 0.385 (0.488)
Median [Min, Max] 0 [0, 1.00] 1.00 [0, 1.00] 0.500 [0, 1.00] 0 [0, 1.00]
age
Mean (SD) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9) 52.5 (16.7)
Median [Min, Max] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0] 54.0 [4.00, 95.0]
ulcer
Mean (SD) 0.313 (0.466) 0.719 (0.453) 0.500 (0.519) 0.439 (0.497)
Median [Min, Max] 0 [0, 1.00] 1.00 [0, 1.00] 0.500 [0, 1.00] 0 [0, 1.00]
thickness
Mean (SD) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63) 2.92 (2.96)
Median [Min, Max] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6] 1.94 [0.100, 17.4]

As you can see above, we have the incorrect values provided of the explanatory variables.

For example, in the variable of sex, we expect to see the number of individuals who identify as male or female, but instead we observe the mean, which is not a proper descriptive statistic as sex is a categorical variable.

To avoid this issue as well as problems in other procedures (like logistic regressions), it is crucial that we remember to factor the variables before we run any function.

As mentioned, two of the independent/explanatory variables of interest need to be converted to factor variables: sex and ulcer. Then, we are to label each level of these variables.

According to the code book, the patient’s sex: 1 = male, 0 = female, and ulcer is an indicator of ulceration : 1 = present, 0 = absent

typeof(melanoma_data$sex)
[1] "double"

melanoma_data$sex <- as.factor(melanoma_data$sex)

# print the first six observations
head(melanoma_data$sex)
[1] 1 1 1 0 1 1
Levels: 0 1

# Recode
melanoma_data$sex <- recode_factor(
  melanoma_data$sex, 
  "0" = "Female",
  "1" = "Male"
)

# Print the first six observations
head(melanoma_data$sex)
[1] Male   Male   Male   Female Male   Male  
Levels: Female Male
typeof(melanoma_data$ulcer)
[1] "double"

melanoma_data$ulcer <- as.factor(melanoma_data$ulcer)

# print the first six observations
head(melanoma_data$ulcer)
[1] 1 0 0 0 1 1
Levels: 0 1

# Recode
melanoma_data$ulcer <- recode_factor(
  melanoma_data$ulcer, 
  "0" = "Absent",
  "1" = "Present"
)

# Print the first six observations
head(melanoma_data$ulcer)
[1] Present Absent  Absent  Absent  Present Present
Levels: Absent Present

In addition, we need to add units to the two continuous variables age and thickness

According to the code book, age is the patient’s age measured in years and thickness corresponds to the tumor’s thickness in millimeters (mm). The package table1 provides an easy way to demonstrate measurement information:

units(melanoma_data$age) <- "years"
units(melanoma_data$thickness) <- "mm"

Additionally, for visual and descriptive purposes, the function table1 is able to easily provide labels for the variables that will be shown in the final table using the label() function. Also, (caption <-) provides a title for the table and (footnote <-) provides any footnote information.

label(melanoma_data$sex) <- "Sex"
label(melanoma_data$age) <- "Age"
label(melanoma_data$ulcer) <- "Ulceration"
label(melanoma_data$thickness) <-"Thickness*"

caption_char <- "Table 1. Melanoma Dataset Descriptive Statistics"
footnote_char <- "*Also known as Breslow thickness"

Below, we can demonstrate the final table1 layout. As you can see, you no longer use factor() in front of the variable as we already factorized it in the previous steps.

table1(
  ~ sex + age + ulcer + thickness | status, 
  data = melanoma_data,
  overall = c(left = "Total"), 
  caption = caption_char, 
  footnote = footnote_char
)
Table 1. Melanoma Dataset Descriptive Statistics
Total
(N=205)
Alive
(N=134)
Died from melanoma
(N=57)
Non-Melanoma death
(N=14)

*Also known as Breslow thickness

Sex
Female 126 (61.5%) 91 (67.9%) 28 (49.1%) 7 (50.0%)
Male 79 (38.5%) 43 (32.1%) 29 (50.9%) 7 (50.0%)
Age (years)
Mean (SD) 52.5 (16.7) 50.0 (15.9) 55.1 (17.9) 65.3 (10.9)
Median [Min, Max] 54.0 [4.00, 95.0] 52.0 [4.00, 84.0] 56.0 [14.0, 95.0] 65.0 [49.0, 86.0]
Ulceration
Absent 115 (56.1%) 92 (68.7%) 16 (28.1%) 7 (50.0%)
Present 90 (43.9%) 42 (31.3%) 41 (71.9%) 7 (50.0%)
Thickness* (mm)
Mean (SD) 2.92 (2.96) 2.24 (2.33) 4.31 (3.57) 3.72 (3.63)
Median [Min, Max] 1.94 [0.100, 17.4] 1.36 [0.100, 12.9] 3.54 [0.320, 17.4] 2.26 [0.160, 12.6]

10 Conclusion

In conclusion, table1 is one of the most utilized tools in the scientific research field. Understanding how to use the table1 package in R can be of benefit to many.

It is important to note that this presentation is just a brief summary with what is possible with this package. For example, you can change the tables appearance with the topclass function which has built in styles like zebra, grid, shades, times and center.

In addition, you can also stratify the response variable to highlight two of the responses, like dead or alive in our example.

References

1.
Hayes-Larson E, Kezios KL, Mooney SJ, Lovasi G. Who is in this study, anyway? Guidelines for a useful Table 1. Journal of Clinical Epidemiology [Internet] 2019;114:125–32. Available from: http://dx.doi.org/10.1016/j.jclinepi.2019.06.011
2.
A. C. Davison, D. V. Hinkley. Bootstrap methods and their applications [Internet]. Cambridge: Cambridge University Press; 1997. Available from: doi:10.1017/CBO9780511802843